TrEMOLO: accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches

Transposable Element MOnitoring with LOng-reads (TrEMOLO) is a new software that combines assembly- and mapping-based approaches to robustly detect genetic elements called transposable elements (TEs). Using high- or low-quality genome assemblies, TrEMOLO can detect most TE insertions and deletions and estimate their allele frequency in populations. Benchmarking with simulated data revealed that TrEMOLO outperforms other state-of-the-art computational tools. TE detection and frequency estimation by TrEMOLO were validated using simulated and experimental datasets. Therefore, TrEMOLO is a comprehensive and suitable tool to accurately study TE dynamics. TrEMOLO is available under GNU GPL3.0 at https://github.com/DrosophilaGenomeEvolution/TrEMOLO. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-02911-2.

a newly assembled genome is a hard and complex task. Many tools have been developed to tackle this issue (e.g., REPET [7], RepeatMasker [8], EDTA [9]), with more or less success, and requiring all a large amount of computational resources. Moreover, tracking the presence or absence of a given insertion among individuals or through generations was, up to recently, limited to heavy wet-laboratory approaches, with a low throughput [10][11][12].
Sequencing technology advances in the last 15 years, especially Illumina short-read sequencing, have allowed sequencing, on a daily basis, not only individuals within a species or subspecies, but also within whole populations. Nowadays, it is not uncommon to sequence the same population many times over generations to study its genetic evolution [13]. One of the main findings provided by the availability of massive sequencing data is that TEs are the most variable component of genomes, even more than previously suspected [14]. Therefore, many tools have been developed to use short-read sequencing data to identify potential TE polymorphisms among samples (e.g., T-lex [15][16][17]; McClintock [18]) or within a population (e.g., DNApipeTE [19]; PoPoolationTE2 [20]). They rely on three main mapping features: split reads, unpaired reads, and depth of coverage [14]. However, short-read sequencing data are not suitable for the optimal detection of structural variations (SV), especially SV due to TE insertion/deletion. Indeed, as short-reads are shorter (100 to 250 bases) than the mean length of repeats and TEs, they do not span the entire length of the variation and thus cannot be used to correctly identify SV, leading to high false-positive rates (40%) and false-negative results [21,22]. Conversely, Long Read (LR) sequencing technologies, such as Pacific Biosciences and Oxford Nanopore Technologies (ONT), allow obtaining reads that are >20kb long (and even 100kb and more for ONT). Therefore, a single read may contain a complete SV or TE insertion and enough surrounding sequence data to unambiguously anchor the variation. However, the longer read length is counterbalanced by a lower sequencing quality (still improving) compared with short-read sequencing and fewer dedicated tools for TE detection. The current methods to detect TE variations using LR sequencing data (and also most of the short-read sequencing data-based tools) generally rely on the presence/ absence of TE insertions annotated in a reference genome (e.g., LorTE [23]), or on the detection of abnormally mapped reads upon the reference genome that can be linked to a TE (e.g., TELR [24]). However, some tools also allow the partial detection, assembly and annotation of non-reference TE insertions (e.g., TLDR) (Transposons from Long DNA Reads) [25].
In addition, unlike short-read, LR sequencing allows generating high-quality genome assemblies with higher levels of contiguity, and consequently to compare genomes directly and to extract more efficiently large and complex SVs [26]. Indeed, the number of available almost complete genomes is increasing due to these LR sequencing technologies [13,27], and even telomere-to-telomere assemblies can be obtained [28]. Moreover, some reliable tools are available for genome-to-genome comparisons (MUMmer [29], minimap2 [30]) and for identifying complex SV between two assemblies more efficiently than with mapping-based approaches. However, assembly methods generally rely on a layout-and-consensus step when solving the multiple possibilities at a given locus (i.e., choosing one of all possible haplotypes in the graph, usually the most represented one), and thus may mask some minor populational/somatic variations.
In this paper, we present the Transposable Element MOvement detection using LOngreads (TrEMOLO) software that combines the advantages offered by LR sequencing (i.e., highly contiguous assembly and unambiguous mapping) to identify TE insertion (and deletion) whatever the frequency at which they segregate in a population. TrEMOLO (available at https:// github. com/ Droso phila Genom eEvol ution/ TrEMO LO) is a Snake-Make pipeline that can be used in script mode with dependencies already installed, or within a singularity container to ensure a higher reproducibility and to facilitate future updates. The pipeline is designed so that the inclusion of additional samples is straightforward. We demonstrated its efficiency using simulated and experimental LR sequencing data from a highly polymorphic Drosophila population. TrEMOLO detected TE insertions that could not be identified with other approaches, with a very low error rate. Therefore, TrEMOLO provides an unprecedented accurate picture of TE variability compared with other tools.

TrEMOLO design
TrEMOLO is a suite of Python 3 [31] scripts that follow the SnakeMake [32] rules. It takes as input a TE library, a reference genome sequence, single or pooled LR DNA sequencing data, and a genome assembly generated from these sequencing data. In the sequel, TE insertions from the major haplotype belonging to the assembly will be called INSIDER TE insertions. The minor frequency haplotype TE insertions are not included in the assembled genome and will be called OUTSIDER TE insertions. The complete tool includes three internal modules ( Fig. 1 and Additional file 1: Fig. S1): (1) the INSIDER TE detection module identifies TE insertions and deletions present only in the newly assembled genome when compared to the reference genome, by aligning these two genomes and parsing the pairwise alignment; (2) the OUTSIDER TE detection module searches for the additional TE insertions (those that are carried by rare haplotypes and are therefore not present in the assembly), by mapping, LRs to the genome assembly generated using the same data, and detecting TE variants by parsing the mapping results ( Fig. 1 and Additional file 1: Fig. S1); (3) the TE analysis module computes from the previous modules outputs the frequency of each TE variant using the LR sequencing data and looks for target site duplications (TSDs) that occur upon TE integration as a result of staggered DNA double-strand breaks [33]. A detailed manual on how to run TrEMOLO with all its options is available at TrEMOLO github (https:// github. com/ Droso phila Genom eEvol ution/ TrEMO LO).

The INSIDER TE detection module
This module detects the so-called INSIDER TEs. These are the non-reference TE insertions/deletions that are found frequently enough among the reads to be incorporated in the assembly (generally the TE polymorphisms of the major haplotype). It takes as input a TE library, a reference genome, and the genome assembly of the studied biological sample, all in fasta format. This assembly must be generated with the input LR sequencing data to be compatible with the second module. Whole-genome pairwise alignment is performed between the reference and assembled genomes using minimap2 [30], and is then parsed with Assemblytics [34] for SV identification (Additional file 1: Fig. S1). This analysis returns a list of SVs from the pairwise alignment of the assembly against the reference genome with their positions. This list is then compared with a TE library to keep only SV belonging to already known TE families ( Fig. 1, left panel, and Additional file 1: Fig. S1). Before providing the final list of detected TE insertions/deletions, the alignment data are scanned to filter out sequences that do not fit the parameters (considered as low-quality TEs), corresponding here to 80% of the size and 80% of identity ("Methods").

The OUTSIDER TE detection module
This module allows to retrieve OUTSIDER TEs, i.e. those TE polymorphisms that are low frequency in the sample and not included in the genome assembly. This module takes as input a TE library in fasta format, a genome assembly (fasta), and the LR sequencing data in fastq used to build this assembly. The reads are aligned to the assembled genome using minimap2 [30]. Reads that are flagged as partially or nonmapping in the output SAM file are candidates for carrying OUTSIDER TEs. Candidate reads are validated if the corresponding sequences are present in the provided TE library ( Fig. 1 right panel, Additional file 1: Fig. S1). As for the INSIDER module, low-quality TEs are filtered out and then, TE insertions are listed in a BED file.

The TE analysis module
This module is dedicated to the analysis of the newly inserted TEs (a) to detect hallmarks of the transposition mechanism (i.e., TSDs) and (b) to estimate the allele frequency of each TE insertion.
(a) For the INSIDER TE insertions, the 30 bp flanking both extremities of the TE insertion are aligned against each other using the Boyer-Moore exact algorithm [35] to detect short motifs shared on each side of the insertion without mismatches. For the OUTSIDER TE insertions, the most informative single LR (with the higher BLAST score) spanning the TE insertion is selected. Then, like for INSIDER TE insertions, the sequences flanking the TE insertion site are extracted and aligned. However, to handle the LR error rate, the algorithm can accept mismatches (default 1). If the putative TSDs are located more than 2bp away from the detected 5′ and 3′ ends of the TE insertions, they are filtered out because it could be a simple duplication present in the genome. In addition, for OUTSIDER TE insertions, if the putative TSD sequence is not found as a unique sequence at the putative insertion locus in the assembled genome, it is also filtered out and not considered as the sequence of the empty site. Finally, retained TSDs allow to better define the 5′ and 3′ ends of the inserted TE. (b) To compute the insertion frequency, the number of reads with the TE insertion is divided by the total number of reads spanning the same position, either with or without TE.

OUTPUT data
Several BED files are produced containing information on the inserted or deleted TE insertions, their nature, and their location on the reference and assembled genomes, as well as the computed frequencies for each of them.

Using simulated datasets to benchmark the ability of TrEMOLO to annotate new TE insertions on a reference genome
We compared the results obtained by TrEMOLO, TELR [24], and TLDR [25], two recently developed computational tools dedicated to TE detection using LR sequencing data. The rationale of these tools is similar to the one used with short-read sequencing data: they use LR data mapped to a reference sequence, through a BAM file for TLDR [25], or directly from the mapping tool output for TELR. Insertions are detected based on discordant mappings (i.e., clipped reads and partially mapped reads). Finally, as for short-read-based approaches, the non-mapped parts of the reads are aligned against a library of canonical TE sequences. Then, discordant mapped reads associated with TE matches are selected, clustered, and analyzed to validate or invalidate the TE insertion sites in the reference genome. We generated a simulated reads dataset (sample S1, see Methods and Additional file 1: Fig. S2) in which 1100 TE insertions with different allele frequencies were added to the D. melanogaster unmasked genome called G0-F100 previously published [13].
For that, we first generated the LR dataset with DeepSimulator on G0-F100 (sb1 at a depth of 10×). Then we used a G0-F100 pseudo genome containing 100 TE insertions to simulate the so-called INSIDER TE sequencing dataset (sb2 at a depth of 45×). To obtain the OUTSIDER TE simulated sequencing dataset (sb3 at a depth of 5×), we launched DeepSimulator on the G0-F100 INSIDER pseudo genome into which we had inserted 1000 additional TEs. Finally, the three LR datasets were mixed together (sb1+sb2+sb3) to obtain the sample S1 (see the "Methods" section and Additional file 1: Fig. S2; simulated data are available at https:// doi. org/ 10. 23708/ N447VS [36,37] for the reads and https:// doi. org/ 10. 23708/ DSDTZ0 for the genome [38]).
We mapped this simulated reads dataset to the unmasked G0-F100 genome using TELR [24], TLDR [25], and the TrEMOLO OUTSIDER module (here called "TrEM-OLO_NA", see the "Methods" section) with minimap2 [30]. We also launched the whole TrEMOLO pipeline (both OUTSIDER and INSIDER modules) with the G0-F100 genome as reference and as the assembled genome, we used the G0-F100 genome with the 100 highly frequent TE insertions. As the TLDR parameters are fixed [25], we decided to use the same parameters for the four methods/tools (TrEM-OLO, TrEMOLO_NA, TLDR, and TELR): 200 bp as minimum TE insertion length and 80% as minimum TE sequence identity. To limit false-positive TE detections, we also set at 15kb the largest TE insertion length.
Using these simulated datasets, we observed that out of the 1100 simulated insertions, TELR and TLDR only detected 12 (1%) and 928 (84.4%) TE true-positive insertions, respectively. By contrast, TrEMOLO_NA found 1064 (96.7%) TE true-positive insertions, illustrating the greater sensitivity of the TrEMOLO mapping algorithm (Fig. 2). Unlike TLDR and TELR tools, TrEMOLO also relies on LR-based genome assemblies, which enables it to retrieve and use the information of the most frequent TE polymorphisms. When using a complete TrEMOLO approach, i.e., by taking advantage of the comparison of two whole-genome assemblies, the detection rate even rose up to 97.3%.
Moreover, only 80 (7%) and 95 (8%) TE false-positive insertions were detected in TrEMOLO_NA and TrEMOLO respectively, which is the same order of magnitude as Fig. 2 TrEMOLO benchmarking using simulated LR sequencing data. The number of true-positive (true (+)), false-positive (false (+)), and false-negative (false (−)) results obtained using the simulated data (1100 known TE insertions) and four methods/tools (TELR, TLDR, TrEMOLO_NA, and TrEMOLO) the TLDR false-positive insertion rate (11%), and remains very low compared to the significant gain of true-positive insertions detected by TrEMOLO.

TrEMOLO experimental validation
To validate TrEMOLO insertion/deletion detection accuracy with real biological samples, we took advantage of a previously described D. melanogaster laboratory line [39] in which the knockdown of Piwi, a protein essential for TE silencing, can be induced in adult follicle cells (somatic cells of the ovary). This transient somatic Piwi knockdown allows TE derepression in follicle cells and their integration as new proviruses in the progeny genome [39]. We performed TE derepression for 73 successive generations to obtain the G73 line that contains thousands of newly integrated TEs in its genome [13,39]. We used genomic DNA extracted from G0-F100 (before Piwi knockdown) and G73 (G73-SRE; see Methods) fly samples to check by genomic PCR, 12 of the 2334 newly integrated TEs detected by TrEMOLO in a previous G73 sequencing dataset (G73 depth 183×) [13]. This small sample is representative of both OUTSIDER (mostly from the gtwin and ZAM families) and INSIDER TE insertions, i.e. insertions detected by TrEM-OLO in either a minority or a majority of the initial G73 (183×) LRs, respectively. Using G73-SRE DNA, genomic PCR yielded amplicons of the expected sizes, whereas no amplification was observed when the G0-F100 reference genomic DNA was used as a Fig. 3 Experimental validation of TrEMOLO specificity of TE insertion/deletion detection. a PCR performed using long DNA molecules extracted from the G0-F100 (control) and G73 Drosophila lines (genomic DNA extracted using Short-Read Eliminator kit: G73-SRE sample). The name of the TE family identified by TrEMOLO is indicated followed by the chromosome arm and the genomic coordinate in the G73 (183×) assembly. The frequencies determined in the G73 (depth 183×) of each TE insertion tested by PCR are indicated as percentages in blue. b Illustration of the primers used to detect a hobo excision. c PCR performed using long DNA molecules extracted from the G0-F100 (control) and G73 Drosophila lines. The primer sequences and amplicon expected sizes are in Additional file 1: Table S1 template (Fig. 3a, upper panels and lower left panel). Moreover, to control the DNA sample quality, we also checked the presence of four TEs shared by both genomes (Fig. 3a, lower right panel). We also experimentally validated the excision of a hobo DNA transposon in G73 that was present in the G0-F100 genome (Fig. 3b, c). Altogether, these data illustrated the efficacy of TrEMOLO to detect both insertions and excisions whatever their frequency in the population.

Impact of the genome assembly quality on TrEMOLO efficiency
To assess the impact of the genome assembly quality on TE insertion/deletion detection, we ran TrEMOLO using the G73 genome assembled with different tools and after different polishing rounds, using the G0-F100 assembly as reference. We used Flye [40], Shasta [41], and Wtdbg2 [42] followed by Racon [43] polishing steps and a scaffolding by RaGOO [44] ("Methods"). As previously reported [13], four rounds of Racon polishing [43] increased the BUSCO score (Table 1), suggesting a better-quality assembly.
Using genomes assembled with the recently updated Flye and Shasta assemblers led to the detection of similar numbers of INSIDER and OUTSIDER TEs, with a slightly higher number of INSIDER TEs for the Shasta-assembled genome (Table 1). When using a genome assembly generated with Wtdbg2, which has not been updated for two years and thus has not incorporated the recent changes in ONT sequencing data (e.g., new base calling inducing a much lower error rate) and has a lower BUSCO score (indicating a lower quality of assembly), TrEMOLO detected roughly the same number of OUT-SIDER TE insertions as the more recent assemblers. Note however, that the INSIDER module produced at least twice as much insertions and deletions (among which many of them were probably false positives) probably because it was based on a low-quality assembly.
Altogether, we conclude that a draft genome assembly, even obtained with a not-sorecent assembler version (with a lower BUSCO score indicating a lower assembly quality), is of enough quality however to allow the identification of TE polymorphisms by TrEMOLO.

Validation of TE frequency estimated by TrEMOLO using simulated data
To validate the accuracy of TrEMOLO frequency estimation, we used simulated datasets with known frequencies of INSIDER and OUTSIDER TEs (see the "Methods" section; Fig. 4a). The frequency is computed by the number of reads supporting the insertion divided by the total number of reads spanning the region of insertion. For OUTSIDER TEs, we count as supporting reads, the reads with the complete insertion as well as the clipped ones (Fig. 4a). TrEMOLO frequency module slightly overestimated the frequency of INSIDER TEs for the simulated datasets (Fig. 4b). For OUT-SIDER TEs, whatever is the frequency, TrEMOLO obtained an accurate estimation of the frequency (Fig. 4c).

Experimental validation of TE frequency estimated by TrEMOLO using droplet digital polymerase chain reaction
To validate the TrEMOLO frequency module accuracy, we re-extracted long DNA molecules from G73 flies (Methods) and analyzed the extracted DNA by ONT sequencing with Short-Reads Eliminator kit (G73-SRE library) and by droplet digital polymerase chain reaction (ddPCR). This latter approach allows the absolute quantification using forward primers located in the left side of the TE landing site and reverse primers located either in the TE or on the right side of the landing site (Fig. 5). To increase the detection specificity, we also designed fluorescent probes. The primer and probe sequences are listed in Additional file 1: Table S2. We selected nine of the 2234 TE insertions, as representatives of different TEs families and of frequencies in the initial G73 library (depth of 183×) [13]. We ran TrEMOLO (parameter: 80% as minimum TE sequence identity and  (Table 2). For two of them, the low frequencies estimated by ddPCR (0.6% and 2.2%) do not allow an accurate estimation of their frequency by TrEMOLO. In the G73-SRE library (depth 82×) there are no reads supporting the ZAM-3L_13591033 insertion, while we detected it using the initial library G73 that has shorter reads and a higher depth (183x). Concerning the blood-2L_18128567 insertion, the TrEMOLO frequency estimation is not congruent with the ddPCR one. This overestimation of the frequency by TrEMOLO (37.68%) illustrates that TrEMOLO estimation starts to be accurate when the insertion TE frequency is around 10% in the population. Indeed, for the other six TE insertions tested, the ddPCR estimates fitted very well with the TrEMOLO estimates, whatever their frequency from 10 to 82%.

Impact of sequencing depth on TE calling and frequency estimation
The total amount of the initial G73 sequencing is averaging a depth of 183×, fairly representative most of the 200 haploid genomes of the 100 males used for the DNA extraction. Indeed, the probability of sequencing one out of the 200 haplotypes from this sample (i.e. the depth per haplotype) was (1:200)×183=0.91. From this dataset, we generated several sequence subsets using Filtlong [45] (see Methods), resulting in seven subsamples of reads (ranging from 145× to 15× of depth) (Additional file 1: Table S3). We then launched TrEMOLO with these datasets, and compared the  number of OUTSIDER TE insertions detected by mapping these reads to the G73 assembly (Additional file 1: Table S3 and Fig. 6). Using the initial dataset G73 (183×), TrEMOLO detected 2334 TEs of which 2274 are OUTSIDER TE insertions. The percentage of detected OUTSIDERs is relative to this dataset. Decreasing the depth from 183x to 76x slightly reduced the number of OUTSIDER TEs detected. Then, TrEMOLO detection performance decreased more rapidly. Nevertheless, even with the lowest depth (15×), TrEMOLO could still identify 47% (1064) of OUTSIDER TE insertions. As expected, these insertions were significantly more frequent in the 183× dataset than the 1210 ones that were lost during the down-sampling (Additional file 1: Fig. S3, p=0,05 Wilcoxon rank test). Thus, in the case of highly polymorphic populations, such as ours, we would recommend a sequencing depth corresponding to a minimum of 0.7× per haplotype (corresponding here to our 145× dataset) to ensure an optimal recovery of low-frequency transpositions.
Among the 2334 original TE insertions, only 820 OUTSIDER TEs were detected in all subsamples (Fig. 7). To test the impact of the sequencing depth on the accuracy of frequency estimates, we computed the frequencies of these 820 shared TE insertions in each subsample. The accuracy of OUTSIDER TE insertion frequency estimation was acceptable when using datasets with sequencing depth ≥76×, with a decrease of 12% maximum compared with the initial frequency computed at 183× (Fig. 7). However, for very "low depth" datasets, (here 0.19× per haplotype or less), the accuracy Fig. 6 OUTSIDER TEs detection depends on depth of sequencing. G73 sequencing reads averaging a 183× depth were subsampled and the OUTSIDER TrEMOLO module was run to detect OUTSIDER TE insertions at each depth frequency decreased quite rapidly with an almost systematic underestimation by up to 70% compared with the original dataset.

Discussion
Accurate detection of TEs with different frequencies in a population or in a single sample at the somatic level is important for researchers in evolution, ecology, agronomy, and medicine. However, it is still a challenge due to the repetitive nature of the TEs. Most of the current approaches to identify TE insertions are based on short-read mapping to a reference genome, sometimes quite distant from the studied samples. Working with distantly related genomes may drastically bias the mapping results and thus the subsequent analyses, for instance by providing high false positive rates [46]. Therefore, TE detection tools based on mapping approaches allow the identification of low-frequency TE insertions (or from the non-major haplotype), but not all of them, only when using very high-depth short-read sequencing data. The availability of low-cost LR sequencing technologies (ONT) opens the possibility to accurately detect new TE insertions. Improving the sequencing quality allows to obtain accurate and (almost) complete genome assemblies in a short amount of time for small-sized ones. For instance, a high-quality Drosophila genome can be assembled in 24 hours [47]. However, most of these assemblers provide a single haplotype sequence as output, generally the most prevalent. Finally, most TE detection tools rely on the identification of TE fragments as a source of SV ab initio, through a global genome annotation without any comparison with other already annotated sequences. We designed TrEMOLO, a comprehensive computational tool with the aim of combining the main haplotype information (assembly step) followed by the recovery of minor haplotypes by LR sequence mapping to the newly assembled genome to detect all SV types without ab initio TE detection. It is only after this naive detection that the SVs are annotated as possible TEs or not. This approach allows to run TrEMOLO using different TE databases. TrEMOLO accuracy in TE identification and the correction system used for TSD detection allow predicting the insertion site within a 2-base pair window (provided, of course, that the extremities of the TE family are precisely defined in the consensus sequence). Therefore, TrEMOLO is currently one of the most efficient tools for TE detection. Moreover, TrEMOLO allows the estimation of the insertion frequency for each detected TE copy with high accuracy (when using data with sequencing depth >50×). Importantly, TrEMOLO performance is weakly influenced by the assembly quality and sequencing depth variations (up to a certain point). In terms of parameters, TrEMOLO in its standard approach will use the previously published 80/80/80 rule for TE recovery [1]. Indeed, it relies on 80% of similarity to and 80% in length of the reference TE sequence. However, as shown in the tests before, we could decrease the length coverage down to 10% and still recover true positive variations. This length parameter can thus be modified for, e.g., identifying partially transposed elements (for insertions of 5′-truncated LINEs), or insertions of abnormal copies (internally deleted TEs still able to transpose). In the same way, decreasing the level of similarity would allow to detect more distantly related copies from the consensus/reference copy in the database. However, in this case, the decrease of the stringency of detection will have a stronger effect on the quality of results (improving the false positive rates) [48], and we do not recommend it. The best approach would be to add in the reference TE file the targeted sequences instead. Finally, a limit of our tool is based on the read size vs TE size: indeed, as 80% of the target TE sequence must be covered, reads have to be quite long. Therefore, with an average TE size of 4.7kb (for our TE database here), we recommend a minimal read size of 4kb if possible to ensure a correct recovery of new insertions.
In terms of execution times and memory usage, it depends a lot on the considered species. However, as an example, the analyses presented here for simulated data were run on a single laptop computer with 16 Gb of RAM and a 12-core i7 Intel processor, in less than 10 hours, and thus can be deployed in almost any lab.
Therefore, TrEMOLO can be used for different projects (e.g., pooled DNA sequencing, population analysis, human genetics, tumor analysis) that require the fast and precise identification of TE insertions, even at very low frequency within a sample. Indeed, our data showed that a single insertion can be unambiguously identified through a unique read, as long as the read is long enough to harbor the whole TE and its surrounding insertion site.

Conclusions
TrEMOLO allows the precise analysis and study of haplotype-specific TE insertions and a better understanding of the true impact of TE evolution in the wider context of their host genome.

INSIDER TE identification
to compare the reference genome and a new assembly, TrEMOLO uses the same methodology as RaGOO (v1.1) [44] up to the SV identification step. Then, SVs are compared to the TE database (Bergman's laboratory, https:// github. com/ bergm anlab/ droso philatrans posons/) with BLASTN (2.9.0) (--outfmt 6) [49]. The output file is filtered based on the size and identity percentages listed in parameters (80% in standard for each of them). DNA fragments that appear to be part of the same sequence are grouped. For INSER-TION, Repeat_expansion, Tandem_expansion, sequences are retrieved from the query (new assembly). For DELETION, Repeat_contraction, Tandem_contraction, sequences are recovered from the reference (Additional file 1: Fig. S1)

OUTSIDER TE identification
the LRs are aligned on the genome resulting from their assembly using minimap2 (v.2.24) [30] and SVs are identified using Sniffles [50]. Reads covering the SV are recovered from the SAM/BAM file using as anchor point the Sniffles provided position and a 30-bp window with SAMtools [51]. The SV reads and sequences are compared to the TE database, and filtered as done for INSIDER TEs, but only the best support reads are kept among those provided by Sniffles or the SAM parsing (Additional file 1: Fig. S1).

TSD detection
both TE flanking sequences are recovered to check whether there are duplicate sequences at the TE junction (see the "Results" section for more details). For an OUT-SIDER TE, if TrEMOLO detects duplicated sequences farther than the initially identified TE junctions, and if they fit to the TSD, then the OUTSIDER TE position is corrected accordingly.

TE frequency estimation
two different approaches are used to estimate the INSIDER and OUTSIDER TE frequencies in the samples. For INSIDER TEs, from the original BAM file only reads aligned to the putative TE positions (with a 100-bp window 5′ and 3′) are retained using SAMtools view and the -F 2048 -L option (removing any reads with these FLAGS value (see SAM specifications) and overlapping a provided BED file, respectively). Then, clipped reads are removed to keep only the reads aligned on the putative TE position within a 5-bp window (5′ and 3′). The mean depth of 30bp at the 5′ and 3′ of the putative insertion is computed with SAMtools depth. Then, the TE frequency is computed as (number of reads showing the insertion)/(total number of reads).
For OUTSIDER TEs, first, reads not aligned to the putative TE insertion site are removed from the original BAM file (SAMtools view with the -l option). Then, the clipping tags in the CIGAR strings for the remaining reads are modified by adding an insertion tag of the TE size plus 10 bp to the read sequence (i.e., the same number of "N" at the corresponding position with a fake quality of 8). Then, Sniffles is run again to obtain the true number of reads associated with a given TE. After recovering the local depth, as done for INSIDER TEs, frequency is computed as (number of reads showing the insertion)/(total number of reads).

Genome assembly
Raw ONT reads for the G73-SRE sample were QC checked using Nanoplot v1.10.2 (https:// github. com/ wdeco ster/ NanoP lot) for sequencing run statistics. Reads with QC <7 were removed by the sequence provider (Montpellier Genomix, Montpellier, France) before QC. For all ONT sequencing data, the genome was assembled as previously published [13], except that a newer version of Flye (v2.9) was used.
Benchmark of Drosophila genome assemblies effect 2 Genome assemblies were performed using the raw LR data and CulebrONT for assembly and polishing [53]. Flye v2.9 [40] was used with standard parameters except for the --nano-raw option. Four rounds of Racon v1.4.3 with minimap2 v.2.24 [30] were performed using the standard conditions for ONT data, and with medaka v1.2 (standard conditions). Scaffolding was performed using RaGOO v1.1 [44] with standard options except for output structural variations with the -s option and the corresponding reference. For raw assemblies, Wtdbg2 v2.5 [42] and Shasta v0.1.0 [41] were used (standard conditions).

Impact of sequencing depth on TE calling and frequency estimation
A previously published G73 long read sequencing library [13,54] was downsized to smaller samples using FiltLong [45] and the following options: -t 19000000000, -t 15000000000, -t 12000000000, -t 10000000000, -t 5000000000, -t 2000000000). This subsampling resulted in a series of sequencing datasets with more or less degraded depths ranging from 183x to 15x.

Genomic DNA extraction and Short-Read Eliminator kit (SRE)
Genomic DNA was extracted from 50 mg of G0-F100 and G73 male flies using the Nanobind Tissue Big DNA Kit (Circulomics SKU NB-900-701-01) with Insect buffer (PL1) as lysis buffer. Briefly, samples were homogenized in 200 µl CT buffer with a pellet pestle, centrifuged at 16,000 × g at 4°C for 2min. Pellets were washed in 1 ml CT buffer and centrifuged at 16,000 x g at 4°C for 2min. Pellets were resuspended in 200µl PL1 lysis buffer (Circulomics Aux. Kit) supplemented with 20µl proteinase K and incubated in a thermomixer at 55°C and 900 rpm for 1 h. Twenty microliters RNase A was added and incubation was continued for 15 min. Samples were centrifuged at 16,000× g for 5 min, and supernatants passed through a 70µm filter and collected in a 1.5 ml DNA LoBind microcentrifuge tube. Fifty microliters BL3, a Nanobind disk, and 400 µl isopropanol were added to the samples. Incubation and washes were done according to the manufacturers' instructions. Elution was performed with 50 µl EB buffer. The genomic DNA concentration was estimated with a Nanodrop spectrophotometer (Thermo Fisher Scientific) and a Qubit Fluorometer with the Qubit dsDNA BR Assay Kit (Thermo Fisher Scientific), according to the manufacturer's instructions. To get rid of reads <25 kb, genomic DNA samples with concentration adjusted between 50 and 150 ng/µl in 60 µl were treated with the SRE Kit, 25 kb cutoff (Circulomics SKU SS-100-101-01), according to the manufacturer's instructions, and eluted in 50 µl EB.

Library preparation for G73-SRE LR sequencing (depth 82×)
Libraries were prepared on 1µg genomic SRE DNA with the SQK-LSK110 Kit (Nanopore) following the manufacturer's instructions. Twelve microliters of each library, 37.5µl of sequencing buffer II (SBII), and 25.5 µl of loading beads II (LBII) were loaded onto the R9.4.1 flow cell. Base-calling was performed using the Guppy v5 hac model.

PCR amplification of long DNA molecules
PCR was performed with 60 ng G0-F100 or G73 genomic DNA in a final volume of 20µl containing 10µl Platinum Hot-Start Green Master Mix 2X (Invitrogen 14001-012) and 0.2 µM final concentration of each primer (see Additional file 1: Table S1 for sequences). Annealing was performed at 60°C for 15 s and elongation was adapted to the amplicon size (15 s/kb). Between 3 and 5 µl of each PCR reaction were loaded on a 1% agarose gel with EtBr.
For droplet generation, 20 μl of each mixture was transferred to the center line of a DG8 droplet generator cartridge (Bio-Rad Laboratories). Then, 70 μl of droplet generation oil for the probe (Bio-Rad Laboratories) was injected in the cartridge bottom line and droplets were produced using a QX200 Droplet Generator (Bio-Rad Laboratories). Droplets (40 µl) were transferred to a 96-well plate and endpoint PCR was performed on a T100 Thermal Cycler (Bio-Rad Laboratories) using the following cycling protocol: enzyme activation at 95 °C for 10 min, followed by 40 cycles of 95 °C for 30 s and 60 °C for 1 min (ramp rate settings of 2°C/s), followed by a droplet stabilization step at 98 °C for 10 min, and a final hold step at 4 °C. Droplet measurement was performed on a QX200 Droplet Reader (Bio-Rad Laboratories) and data analyzed with the QuantaSoft software (Bio-Rad Laboratories). Two different fluorescent signals were measured: (i) amplification of the TE inserted in a given locus sequence (TE in ) or amplification of the same locus without TE (TE out ) in the FAM channel, and (ii) amplification of the RpL32 housekeeping gene in the HEX channel. The proportion of TE in or TE out to Rpl32 was calculated as follows: TE in /Rpl32 or TE out /Rpl32. Each value was expressed as copies/µl. Frequency was calculated with the following formula: frequency in %= (TE in / Rpl32×100)/(TE in /Rpl32 + TE out /Rpl3).